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Abstract — We describe a non-parametric, "example-based" method for estimating the depth of an object, viewed in a single photo. 
Our method consults a database of example 3D geometries, searching for those which look similar to the object in the photo. The 
known depths of the selected database objects act as shape priors which constrain the process of estimating the object's depth. We 
show how this process can be performed by optimizing a well defined target likelihood function, via a hard-EIVI procedure. We address 
the problem of representing the (possibly infinite) variability of viewing conditions with a finite (and often very small) example set, by 
proposing an on-the-fly example update scheme. We further demonstrate the importance of non-stationarity in avoiding misleading 
examples when estimating structured shapes. We evaluate our method and present both qualitative as well as quantitative results for 
challenging object classes. Finally, we show how this same technique may be readily applied to a number of related problems. These 
include the novel task of estimating the occluded depth of an object's backside and the task of tailoring custom fitting image-maps for 
input depths. 



1 Introduction 

THE human visual system is remarkably adapt at 
estimating the shapes of objects from just a single 
view, despite this being an ill posed problem; many 
different shapes can appear the same in an image and 
any one of them is as plausible as the next. To overcome 
this difficulty, existing computational methods routinely 
make a-priori assumptions on the lighting properties, the 
object's surface properties, the structure of the scene, and 
more. 

Here, we make the following alternative assumption: 
We assume that the object viewed is roughly similar in 
shape (but by no means identical) to the shapes of a 
set of related objects. The obvious example here is of 
course faces. As we will later show, examples of typical 
face shapes can be used to estimate even highly unusual 
faces. We claim that the same is true for other object 
classes and indeed demonstrate results for images of 
challenging objects, including hands, full body figures 
(non-rigid objects), and fish (highly textured objects). 

Specifically, we assume that we have at our disposal 
a database of relevant example 3D geometries. We can 
easily obtain the appearances of these objects, viewed 
under any desired viewing condition, by using stan- 
dard rendering techniques. To estimate the shape of a 
novel object from a single image we search through the 
appearances of these objects, possibly rendering new 
appearances as we go, looking for ones appearing similar 
to the input image. Once found, the known depths of 
these selected objects serve as priors for the object's 
shape. We perform this task at the patch level, thus 
obtaining depth estimates very different from those in 
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the database. This process is performed via a Hard-EM 
procedure, optimizing a well defined target likelihood 
function representing the likelihood of the estimated 
depth given the input image and the set of examples. 

This approach to depth estimation has a number of 
advantages: (1) Our method is non-parametric, and as 
such, requires no a-priori model selection or design. 
Consequently, (2) it is versatile. As we will show, the 
same method is used to estimate the shapes of ob- 
jects belonging to very different object classes and even 
to solve additional related tasks. Finally, (3) a data- 
driven approach requires making no assumptions on the 
properties of the object in the image nor the viewing 
conditions. Our chief requirement is the existence of a 
suitable set of 3D examples. We believe this to be a 
reasonable requirement given the growing availability 
of such databases. 

Obviously, in taking an example-based approach to 
depth estimation, we have no guarantee that the example 
data sets we use contain objects sufficiently similar to the 
one in the input image. We therefore follow the example 
of methods such as [IJ, [2J, [3J in seeking to produce 
plausible depth estimates and not necessarily the true 
depths. Here, however, the concept of a plausible depth 
is formally defined by our target function. Moreover, we 
present quantitative results suggesting that our method 
is indeed capable of producing accurate estimates even 
for challenging objects, given an adequate example set. 

To summarize, this report reviews the following top- 
ics. 

• Example-based approach to depth estimation. We 
describe an approach to single-view depth estima- 
tion and present both qualitative and quantitative 
results on a number of challenging object classes. 
We have tested out method on large sets of objects, 
including real and synthetic images of objects with 
arbitrary texture, pose, and genus, viewed under 
unconstrained viewing conditions. 

• On-the-fly example update scheme. We augment 



existing example-based methods by arguing that 
examples need not he selected a-priori. To handle the 
possible infinite viewing conditions and postures of 
the objects being reconstructed we produce better 
suited examples while removing less adequate ones 
on-the-fly, as part of the reconstruction process. 

• Non-stationarity for structured shape recon- 
struction. We emphasize the importance of non- 
stationarity in avoiding depth ambiguities and mak- 
ing better example selections. 

• Additional applications. We show how the same 
method used for depth estimation may also be used 
for the additional tasks of estimating the depths of 
the occluded backside of objects viewed in an image 
as well as estimating the colors of objects from their 
shape. 

The rest of this report is organized as follows. In the next 
section we review related work. Our depth estimation 
framework is described in Sec. |3] Our example update 
scheme is presented in Sec. El followed by a discussion on 
non-stationarity in Sec. |5] We propose additional appli- 
cations, based on our method in Sec. [6] Implementation 
and results are presented in Sec. |7| Finally, we conclude 
in Sec. [U 



2 Related work 

Depth estimation. There is an immense volume of liter- 
ature on the problem of estimating the shapes of objects 
or scenes from a single image. Indeed, this problem 
is considered to be one of the classical challenges of 
Computer Vision. Methods for single image reconstruc- 
tion very often rely on different cues such as shading, 
silhouette shapes, texture, and vanishing points (e.g., [IJ, 
EI, 13, in, 13, |8|). These methods restrict the allowable 
reconstructions by placing constraints on the proper- 
ties of reconstructed objects (e.g., reflectance properties, 
viewing conditions, and symmetry). 

There has been a growing interest in producing depth 
estimates for large scale outdoor scenes from single im- 
ages. One approach IJ), f9] reconstructs outdoor scenes 
assuming they can be labeled as "ground,'' "sky,'' and 
"vertical" billboards. Other approaches include the Dio- 
rama construction method of [IJ and the MakeSD system 
of (TOl . Although both visually pleasing and quantita- 
tively accurate estimates have been demonstrated, it is 
unclear how to extend these methods to classes other 
than outdoor scenes. 

Recently, there is a growing number of methods 
explicitly using examples to guide the reconstruction 
process. One notable approach makes the assumption 
that all 3D objects in the class being modeled lie in a 
linear space spanned using a few basis objects (e.g., IITTI , 
lEl/ ESI/ El)- This approach is applicable to faces, but 
it is less clear how to extend it to more variable classes 
because it requires dense correspondences between 
surface points across examples. Another approach (151 



uses a single example to produce accurate, shape-from- 
shading estimates of face shapes. This approach too is 
tailored for the particular problem of estimating face 
shapes. By contrast, our chief assumption is that the 
object viewed in the query image has similar looking 
counterparts in our example set, and so can be applied 
to produce depth estimates for a range of different 
object classes. 



Synthesis "by-example". A fully data-driven method 
was first proposed in [16J, inspired by methods for 
constructing 3D models by combining parts [17]. It 
operates by assuming a collection of example, reference 
images, along with known 3D shapes (depths). For a 
given query image, it seeks to match the appearance 
of the query to the appearance of these references, and 
produces a depth estimate by combining the known 
reference depth values, associated with the matching 
appearances. This report elaborates on the original 
method described in |16| and provides additional 
information and results compared to that paper. 



Shape decoration. Sec. [6] demonstrate how our frame- 
work can be applied to solve additional problems be- 
yond shape reconstruction. In particular, we demon- 
strate the use of our method for automatically colorizing 
depth-maps, as a quick means for decorating 3D shapes 
(Sec |6.2[ ). Existing automatic methods for decorating 3D 
models have mostly focused on covering the surface of 
3D models with texture examples (e.g., |18|, |19|, |20J). 
We note in particular a work concurrent to our own \T\\ , 
which uses an optimization procedure similar to the one 
used here. Their goal, however, is to cover 3D surfaces 
with 2D texture examples. Finally, recent methods have 
attempted to allow the modeler semi-automatic means 
of producing non-textural image-maps (e.g., [22]). These 
methods rely on the user forming explicit correspon- 
dences between parts of the 3D surface, and different 
texture examples, which are then merged together to 
produce the complete image-map for an input 3D model. 
Our work, on the other hand, is fully automatic. 

Finally, there have been a number of publications pre- 
senting methods for 3D model (e.g., triangulated mesh) 
correspondences and cross parameterization (e.g., [23], 
[24]). These methods establish correspondences across 
two or more 3D surfaces. Once these correspondences 
are computed, surface properties such as texture, can be 
transferred from one corresponding 3D object to another, 
thus providing a novel model with a custom image-map. 
These methods, however, often require a human modeler 
to input a seed set of correspondences across the models, 
or else assume the models are similar in general form. 
In our colorization method, no prior correspondences are 
required, the process is fully automatic, and the models 
need only be locally similar. 
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Fig. 1. Visualization of our process, (a) The input image, (b) Step (i) finds for every query patch a similar patch in 
the database. Each patch provides depth estimates for the pixels it covers. Thus, overlapping patches provide several 
depth estimates for each pixel. These values are combined at each pixel to produce a new estimate for that pixel's 
depth in Step (ii). This process is then repeated until convergence (Step (iii)) by returning to Step (i), now searching 
for patches matching in both intensity and depth, using the current depth estimate for the comparison, (c) Our final 
depth estimate. 



3 Estimating depth from examples 

Given a query image / of some object of a certain class, 
our goal is to estimate a depth map D for the object. 
To this end we use examples of feasible mappings from 
intensities (appearances) to depths for the class. These 
mappings are given in a database S = {M^}^^^ = 
{(/i, Di)}'^^-^, where Ii and Di respectively are the image 
and the depth-map of an object from the class. These 
image-depth pairs are produced by applying standard 
rendering techniques to a set of example, textured, 3D 
geometries. For simplicity we assume first that all the 
image-depth pairs in the database were produced by 
rendering the geometries from a single viewing direc- 
tion, shared also with the input image. Later in Sec.|4]we 
relax this assumption by demonstrating how an estimate 
of the camera pose may be recovered along with the 
depth. 

Our goal is to produce a depth D such that every kxk 
patch of mappings in M = {I^D) will have a similar 
counterpart in S (i.e., will be feasible). Specifically, we 
seek a depth D satisfying the following two criteria: i 

1) For every kxk patch of mappings in M, there is 
a similar patch in S, and 

2) if two patches overlap a pixel p, then the two 
database patches selected as their matches must 
agree of the depth at p. 

We next describe how we produce depth estimates sat- 
isfying these criteria. 

3.1 Optimization scheme 

Given an input image /, we produce a depth estimate D 
meeting the two criteria mentioned above, by building 
on the following simple, two-step procedure (see also 
Fig. Il|: (i) At every location p in I we consider a k x k 



window around p and seek a matching window in the 
database with a similar intensity pattern in the least 
squares sense (Fig.[l](z)). (ii) Finding such a window, we 
extract its corresponding kxk depths. We do this for all 
pixels in /, matching overlapping intensity patterns and 
obtaining k'^ depth estimates for every pixel coordinate. 
The depth value at every p is then determined by 
taking an Gaussian weighted mean of these k'^ estimates 
(Fig. [l](n))- Here, the Gaussian weights weigh in favor 
of estimates from patches centered closer to p. 

Of course, there is nothing to guarantee that the depth 
estimate obtained by executing these two steps just once 
will meet our criteria. In order to produce a suitable 
estimate, we therefore take the current depth to be an 
initial guess which we then refine iteratively We repeat 
the following process until convergence (see also Fig.[2|: 
At every step we seek for every patch in M, a database 
patch similar in both intensity as well as depth, using 
D from the previous iteration for the comparison. Thus, 
unlike the initial step, we now look for similar mappings. 
Having found new matches, we compute a new depth 
estimate for each pixel as before, by taking the Gaussian 
weighted mean of its k'^ estimates. In Section 3.2 we 



prove that this two-step procedure is a hard-EM opti- 
mization of a well defined target function. As such, it is 
guaranteed to converge to a local optimum of the target 
function. 

Fig. |2] summarizes this process. The function 
get Similar Patches searches S for patches of mappings 
which match those of M, in the least squares sense. 
The set of all such matching patches is denoted V. The 
function updateDepths then updates the depth estimate 
D at every pixel p by taking the weighted mean over 
all depth values for p in V. 



D = estimateDepth(/, S) 

M = (/,?) 

repeat until no change in M 
(i) V = getSimilarPatches(M, S) 
(ii) D = updateDepths(M, V) 
M={I,D) 

Fig. 2. Summary of the basic steps of our algorithm. 



3.2 Plausibility as a likelihood function 

We now analyze our iterative process and show that it 
is in fact a hard-EM optimization [25] of the following 
target function (which in turn, satisfies our criteria of 
Sec. [3|. Denote hy Wp a k x k window from the query 
M centered at p, containing both intensity values and 
(unknown) depth values, and denote by F a similar 
window in some Mi e S. Our target function can now 
be defined as 

Plaus{D\I,S) = Ylmd^xSim{Wp,V), (1) 



pel 



ves 



with the similarity measure Sim{Wp,V) being: 



Sim{Wp, V) = exp {-^{Wp - V)' J^'^iWp - V) \ , (2) 

where S is a constant diagonal matrix, its components 
representing the individual variances of the intensity 
and depth components of patches in the class. These 
are provided by the user as weights (see also Sec. |7.1[ ). 
To make this norm robust to illumination changes we 
normalize the intensities in each window to have zero 
mean and unit variance, similarly to the normalization 
often applied to patches in detection and recognition 
methods (e.g. [26J). 
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over the appearance and depth respectively, implies 

f{I,V;D) = llSim{Wp,Vp). 
pel 

Where Sim is defined in ^. Integrating over all possible 
assignments of V we obtain the likelihood function 

L = /(/; D) = Y, f{I. ^D) = Y,l[ Sim{Wp, Vp). 
V V pel 

We approximate the sum with a maximum operator. 
Note that this is common practice for EM algorithms, 
often referred to as hard-EM (e.g., 1.251 ). Since similarities 
can be computed independently, we can interchange the 
product and maximum operators, obtaining the follow- 
ing maximum likelihood: 

maxL ^ ]^maxS'im(Wp, V) = Plaus{D\I, S), 
pel 

which is our cost function ([l|. 

The function estimate Depth (Fig. |2| maximizes this 
measure by implementing a hard-EM optimization. The 
function g etS im.il ar Patches performs a hard E-step by 
selecting the set of assignments V^+^ for time t-\-l which 
maximizes the posterior: 

f{V'^'\P,D') (xl[Sim{Wp,Vp). 
pel 

Here, D^ is the depth estimate at time t. Due to the in- 
dependence of patch similarities, this can be maximized 
by finding for each patch in M the most similar patch 
in the database, in the least squares sense. 

The function updateDepths approximates the M-step 
by finding the most likely depth assignment at each 
pixel: 



D {p) = arg max( 

Dip) 



■ Y.{D{p)-depth{V^^\p))')). 

qeWp 



Image/ 



This is maximized by taking the mean depth value 
over all k'^ estimates depth{Vq^^ (p)) , for all neighboring 
pixels q. 

We note that optimization with Hard-EM, well known 
to converge to a local optimum of the target func- 
tion 1251. 



Fig. 3. Graphical model representation. Please see text 
for more details. 

In Fig. Is] we represent the intensities of the query im- 
age / as observables and the matching database patches 
V and the sought depth values D as hidden variables. 
The joint probability of the observed and hidden vari- 
ables can be formulated through the edge potentials by 

/(/,V;I)) = n n cl^liVp{q)J{q))-MVp{q)^D{q)) 
pel qeWp 

where Vp is the database patch matched with an image 
patch Wp centered at p by the global assignment V. Tak- 
ing (j)i and (j)D to be Gaussians with different covariances 



4 Finding the right examples 

By-example, patch based approaches have become quite 
popular and are successfully employed for solving prob- 
lems ranging from texture synthesis to recognition. The 
underlying assumption behind these methods is that 
class variability can be captured by a finite, preferably 
small, set of examples. Many applications can typically 
guarantee these conditions (notably texture synthesis). 
However, when the examples include non-rigid objects, 
objects varying in texture, or when viewing conditions 
are allowed to change, it becomes increasingly harder 
to apply these methods: Adding more examples to al- 
low more variability (e.g., rotations of the input image 



in lEZl), implies larger storage requirements, longer run- 
ning times, and higher risk of false matches. 

Our goal here is to handle objects viewed from any di- 
rection, non-rigid objects (e.g. hands), and objects which 
vary in texture (e.g. fish). Ideally, we would like to 
use few examples whose shape (depth) is similar to 
that of the object in the input image, viewed under 
similar conditions. This, however, implies a chicken-and- 
egg problem: Depth estimation requires choosing similar 
example objects, but knowing which objects are similar 
first requires a depth estimate. 

Our optimization scheme provides a convenient 
means of solving this problem. Instead of committing 
beforehand to a fixed set of examples we update the 
set of examples, on-the-fly , alongside the optimization 
process. We start with an initial seed database of ex- 
amples. In subsequent iterations of our optimization we 
drop the least used examples Mi from our database, 
replacing them with ones deemed better suited for the 
depth estimation process. These are produced by on-the- 
fly rendering of more suitable 3D models, with viewing 
conditions closer to the one used in the query. In our 
experiments, we applied this idea to search for more 
similar example objects and better viewing angles. We 
believe that other parameters such as lighting conditions 
can also be similarly resolved. We next describe the 
details of our implementation. 

4.1 Searching for the best views 

Fig. |4] demonstrates a depth estimation result produced 
by using example images generated from a single in- 
correct viewing angle (Fig. Ela) and four fixed, widely 
spaced viewing angles (Fig.Hlb). Both results are inade- 
quate. 

It stands to reason that mappings from viewing angles 
closer to the true one, will contribute more patches to 
the process than those further away. We thus adopt 
the following scheme. We start with a small number 
of pre-selected views, sparsely covering parts of the 
viewing sphere (the gray cameras in Fig. He). The seed 
database S is produced by taking the mappings Mi of 
our objects, rendered from these views, and is used to 
obtain an initial depth estimate. In subsequent iterations, 
we re-estimate our views by taking the mean of the 
currently used angles, linearly weighted by the relative 
number of patches selected from each angle. We then 
drop from S mappings originating from the least used 
angle and replace them with ones from the new view. If 
the new view is sufficiently close to one of the remaining 
angles (e.g., its distance to an existing view falls below 
a predefined threshold), we instead increase the number 
of objects to maintain the size of S. Fig. |4]c presents a 
result obtained with our angle update scheme. 

Although methods exist which accurately estimate the 
viewing angle |28|, |29|, we preferred embedding this 
estimation in our optimization. To understand why, con- 
sider non-rigid classes such as the human body where 
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Fig. 4. Depth estimate with unknown viewing 
angle. A woman's face viewed from camera angle 

(a,/3) = (0°,-22°). (a) Database mappings S ren- 
dered with the camera at angle (0°,0°). (b) Database 
generated with cameras positioned in angles (-20°, 0°), 
(20°, 0°), (-20°, -40°), and (20,-40), without updating 
the database viewing position, (c) Estimating depth while 
updating the database camera view on-the-fly. Starting 
from the angles in (b), now updating angles until conver- 
gence to (-6°, -24°). 



posture cannot be captured with only a few parameters. 
Our approach uses information from several viewing 
angles simultaneously, without pre-committing to any 
single view. 

4.2 Searching for the best example objects 

Although we have collected at least 40 objects in each 
database, we can use no more than 12 objects at a time 
in the optimization, as it becomes increasingly difficult 
to handle larger sets. We select the ones used in practice 
as follows. Starting from a set of arbitrarily selected 
seed objects, at every update step we drop those leased 
referenced. We then scan the remainder of our objects 
for those who's depth, Di, best matches the current 
depth estimate D (i.e., {D — DiY is smallest, D and Di 
center aligned) adding them to the database instead of 
those dropped. In practice, a fourth of our objects were 
replaced after the first iteration of our process. 

5 Preserving global structure 



The scheme described in Sec. [3.1} makes an implicit 
stationarity assumption [30): Put simply, the probability 
for the depth at any pixel, given those of its neighbors, is 
the same throughout the output image. This is generally 
untrue for structured objects, where depth often depends 
on position. For example, the probability of a pixel's 
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Fig. 5. Preserving relative position, (a) Input image, (b) 
Depth estimate without position preservation constraints 
and (c) with them. 



depth being "tip-of-the-nose high'' is different at differ- 
ent locations of a face. To overcome this problem, we 
suggest enforcing non-stationarity by adding additional 
constraints to the patch matching process. Specifically, 
we encourage selection of patches from similar semantic 
parts by favoring patches which match not only in 
intensities and depth, but also in position relative to 
the centroid of the input depth-map. This is achieved 
by adding relative position values to each patch of 
mappings in both the database and the query image. 

Let p = {x^y) he the coordinates of a pixel in / (the 
center of the patch Wp) and let (xc, i/c) be the coordinates 
of the center of mass of the area occupied by non 
background depths in the current depth estimate D. We 
add the values {Sx^Sy) = {x — Xc^y — yc)/ to each such 
patch Wp and similar values to all database patches (i.e., 
by using the center of each depth image Di for {xc,yc))- 
These values now force the matching process to find 
patches similar in both mapping and global position. 
Fig. [5] demonstrates a reconstruction result with and 
without these constraints. 

If the query object is segmented from the background, 
an initial estimate for the query's centroid can be ob- 
tained from the foreground pixels. Alternatively, this 
constraint can be applied only after an initial depth 
estimate has been computed (i.e.. Sec. |3|. 

6 Additional applications 

One of the appealing aspects of this example-based 
approach is that other problems besides depth estimation 
may be similarly solved with little or no change to 
the method we described. Specifically, we have thus 
far considered example mappings from appearances to 
depths. This has allowed us to estimate depths for input 
appearances. We next show how by defining alternative 
mappings we obtain solutions to additional problems 
within the very same framework. 

6.1 Backside reconstruction 

What can be said about the shape of a surface which does 
not appear in the image? Methods for depth estimation 
have predominantly focused on estimating the shapes 



of surfaces visible in images. Here we suggest that an 
input image may contain sufficient cues which, coupled 
with additional examples, may allow us to guess the 
shapes of surfaces even when they are occluded in the 
image. Specifically, we next demonstrate how the shape 
of the occluded backside of an object may be estimated 
by using the same process described thus far, from a 
single image of the object's front. 

We consider a database of mappings containing not 
appearances and their corresponding depths, but rather 
depths and a corresponding second depth layers (or in 
general, multiple depth layers). This additional depth 
layer is taken here to be the depth at the back of the 
objects viewed. Once again, we generate this database 
by applying standard rendering techniques to our ex- 
ample 3D geometries. We thus obtain a database S' = 
{M/}^^i = {(A,^-)}Li, where where D'- is the second 
depth layer. 

Having recovered the visible depth of an object (its 
depth map, D), we define the mapping from visible to 
occluded depth as M'{p) = {D{p)^D'{p)), where D' is 
its second depth layer. Synthesizing D' can now proceed 
similarly to the synthesis of the visible depth layers. We 
note that this second depth layer may indeed have little 
in common with the true depth at the object's back. It is, 
however, reminiscent of the image hole-filling problem 
in attempting to produce a plausible estimate of this 
information, where none other exists. 



6.2 Automatic depth-map colorization 

An additional problem may be solved by reversing 
our original mappings. Here we propose an application 
similar in spirit to the problem faced by a sculptor when 
applying paint to enhance the appearance of statues. 
Given an input depth-map, our goal is to fabricate a 
tailor made image-map for the depth. The motivation for 
doing this comes from the graphics community, where 
considerable effort is put into developing automatic 3D 
colorization techniques. Here we achieve this goal by 
simply switching the roles of the intensities and depths 
in the example mappings: We now use a database S = 
{M;}^_i = {(A,/i)}Li- Given an input depth map D, 
our goal is now to produce an image map / such that 
M' = {D, I) consists of feasible mappings from shape to 
intensities. 

We have found that for this particular application, on- 
the-fly database update is unnecessary, as our input is a 
3D shape, allowing us to easily select similar shapes from 
the database before commencing with the optimization. 
We thus choose for synthesis a small number (often as 
small as one or two) of models who's depths best match 
the input depth in the least squares sense. These are used 
throughout the synthesis process. We note that when 
only one database object is used, our method effectively 
morphs its image-map to fit the 3D features of the novel 
input depth (See Sec. |7.2[). 




role in their reconstruction. On the other hand, due to 
the variability of human postures, relative position is 
less reliable for that class. We therefore amplify different 
components of each patch, Wp, of mappings for differ- 



Fig. 6. Depth estimates at multiple resolutions. From 
left to right, input image, five intermediate deptli-map 
estimates from different resolutions, and a zoomed in view 
of our output reconstruction. 




Fig. 7. Database mappings used as examples. In the 

top row, two appearance-depth images, out of the 67 in 
the Fish database. Bottom row, two of 50 pairs from our 
Human-posture database. 



7 Implementation and results 

7.1 Representing mappings 

For the purpose of depth reconstruction, the mapping 
at each pixel in M = {I^D), and similarly every Mi = 
{li.Di), encodes both appearance and depth (See exam- 
ples in Fig. [tI. In practice, the appearance component 
of each pixel is its intensity and high frequency values, 
as encoded in the Gaussian and Laplacian pyramids 
of / [31 J. We have found direct synthesis of depths to 
result in low frequency noise (e.g., "lumpy'' surfaces). 
We thus estimate a Laplacian pyramid of the depth, 
producing the final depth by collapsing the depth high- 
frequencies estimates from all scales. In this fashion, 
low frequency depths are synthesized in the course 
scale of the pyramid and only sharpened at finer scales 
(See example in Fig. [6|. For depth colorization we used 
mappings from depths and depth high frequencies to Y, 
Cb, Cr components, also computed at different scales of 
the Gaussian and Laplacian pyramids. 

Different patch components, including relative posi- 
tions, contribute different amounts of information in 
different classes, as reflected by their different variance 
(i.e., S in the definition of Sim, Eq.S. For example, faces 
are highly structured, thus, position plays an important 



ent classes, by weighting them differently. Section 7.2 
present weights computed automatically for different 
object classes and quantitative results obtained with 
these weights. 

Finally, we note that, in principle, database objects 
may come in any coordinate system, and in particular 
their depth values can be shifted (i.e., z' = z ^ zq). 
This may pose a problem when combining depths from 
different objects to form a single estimate. A possible 
solution would be to synthesize surface normals instead 
of depths (as in, e.g., |32|). Doing so, however, raises 
the problem of dealing with depth discontinuities. Here 
we chose instead to produce our examples in a common 
frame of reference by setting z = at the centroid of 
the 3D object and performing the reconstruction in this 
common frame of reference. 



7.2 Experiments 

In our reconstruction and colorization experiments, we 
used the following data sets. 52 Hand and 45 Human- 
posture objects, produced by exporting built-in models 
from the Poser software, 77 busts from the USF head 
database 1331 , and a fish database 1341 containing 41 
models. In addition, for the colorization experiments 
we used a database of five human figures courtesy of 
Cyberware [35 J. Our objects are stored as textured 3D tri- 
angulated meshes. We can thus render them to produce 
example mappings using any standard rendering engine. 
Example mappings from the fish and human posture 
data-sets are displayed in Fig. ItI We used 3D Studio Max 
for rendering the images and depth-maps. We preferred 
pre-rendering the images and depth maps instead of ren- 
dering different objects from different angles on-the-fly 
Thus, we trade rendering times with disk access times 
and large storage. Note that this is an implementation 
decision; at any one time we load only a small number 
of images to memory. The angle update step (Sec. |4]| 
therefore selects the existing pre-rendered angle closest 
to the mean angle. 

Depth Reconstruction. Our results include depth esti- 
mates from single images for structured objects such 
as faces (Fig. |4] |5] [lO]) as well as highly non-rigid 
objects such as hands (Fig. [l] and [ll| and full bodies 
(Fig. [6] and [8| in various postures. These results in- 
clude in particular objects with higher than zero genus 
(e.g.. Fig. |8| and objects with depth discontinuities such 
as the fingers of the hand in Fig. IT] and the fin of 
the left fish in Fig. [9] Additionally, we show that our 
method can produce estimates even when the objects in 
the image are highly textured as in the fish examples 
in Fig. [9] Similarly to shape-from-shading methods, we 
assume here that the query objects were pre-segmented 
from their background and then aligned with a single 






Fig. 8. Full body depth results. Left to right: Input image, the output depth without and with texture, input image 
of a man, output depth, textured view of the output, output estimate of the depth at the back. Man results shown 
zoomed-in. 




Fig. 9. Two fish depth results. Left to right: Input image (removed from the example database); estimated depth and 
a textured view of the output; Input image; estimated depth and a textured view of the output. 
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Fig. 1 0. Two face depth results. Left to right: Input image, 
four most referenced database images in the last iteration, 
our output depth without and with texture, input image, 
four most referenced database images in the last iteration, 
our output depth without and with texture. 



preselected database image to solve for scale and image- 
plane rotation differences (see, e.g., [15]). 

It is interesting to compare our method with a method 
tailored to reconstructing face depths [15] (see Fig. Il2|. 
For a non-standard face (a cyclops) our patch based 
method appropriately produces a shape estimate with 
only one eye socket, using examples of typical, binocular 
faces. Although [15] produce a finer detailed estimate, 
their strong global face shape prior results in an estimate 
erroneously containing three separate eyes. 

Occluded back estimation results (Sec. |6.1[ ) are pre- 
sented in Fig. Island 11 Two non-structured objects with 
non-trivial backs were selected for these tests. 



Fig. 11. Hand depth result, (a) Input image, (b) Our 
output, (c) Output estimate for the back of the hand. 



In general, the quality of our depth estimate depends 
on the database used, the input image, and how the two 
match. Fig. 13 presents some failed results. In Fig. p3|a) 
our method's lack of a global prior resulted in the middle 
finger which both points forward and downwards. In 
Fig. 13 ^b) the subject was waring a dark shirt and bright 
trousers, very different from the uniformly colored ob- 
jects in the database. 



TABLE 1 

Depth estimation database parameters, m - Number of 

mappings (objects) Mi used for synthesis. Weights for 

intensities, depth, and relative position components. 

Patch sizes were 7x7 pixels in all tests. 



DB Name 


m 


Weights 


Human-posture 

Hands 

Fish 


4 

5 
5 


0.2140, 0.1116, 0.0092 
0.1294, 0.1120, 0.0103 
0.1361, 0.1063, 0.0116 



Quantitative depth estimation results. To evaluate the 
performance of our algorithm, we ran leave-one-out, 
depth estimation tests on the human-posture, hands, and 





(a) 



Fig. 12. Cyclops depth result, (a) Input image of a 
Cyclopean face, (b-c) Top row: Our depth estimate ren- 
dered in 3D and a textured view; Bottom row: Depth 
estimate produced using the method of fT55 rendered 
in 3D and a textured view. Both methods use example 
shapes of binocular faces. Although [15J produce more 
detailed estimates, their strong prior on a face shape 
results in a face with three separate eyes. 





Fig. 13. Failures, (a) Hand reconstructions are particu- 
larly challenging, as they are largely untextured, and can 
vary greatly in posture, (b) The uniform black shirt differed 
greatly from the ones worn by our database objects (see 
Fig.[7|. No reliable matches could thus be found, resulting 
in a lumpy surface. Resulting surface presented from a 
zoomed-in view. 



fish data sets. We used five randomly selected objects 
froin each object class as training. Their images and 
ground truth depths were used to automatically search 
for optimal weights for the three components of the 
mappings (appearance, depth, and relative position). 
We used a direct simplex search method to search for 
these three parameters separately for each class mini- 
mizing the error between our depth estimate and the 
known ground truth. The parameters thus obtained are 
presented in Table [l] The search for parameters was 
performed only once for each class, and the parameters 
obtained were applied to all input images. We did not 
screen our results for any failures and all depth estimates 



TABLE 2 

Depth estimation quantitative results. Mean and STD 

results of LI distances between estimated depths and 

ground truth. 



DB Name 


Baseline 


MakeSD [101 


Our method 


Posture 

Hands 

Fish 


.040 ± .01 
.039 ± .02 
.044 ± .02 


.248 ± .09 
.228 ± .05 
.277 ± .12 


.023 ± .00 
.026 ± .01 
.036 ± .02 



were included when computing the global result. 

We next estimated the depths of the objects which 
were not included in the training. The quality of our 
estimates was compared against a naive selection of the 
depths belonging to the database objects most similar 
in appearance to the test images. We have included the 
accuracy obtained by applying the method of [lOJ, using 
their own codq^ We note, however, that this method 
was developed and optimized for outdoor scenes, and so 
it is not surprising that it should under-perform when 
applied to images of objects. Table |2] summarizes our 
results, presenting a comparison of the mean and STD 
LI distances between the ground truth depths and the 
estimated depths produced by our method, MakeSD |10| 
and the naive selection as baseline. 

Fig. 14 15 and [l6] present depth estimates obtained 
by own own method in these batch tests. The wide 
standard deviation of both the baseline and our method, 
as reported in Table [2J suggests that these three data 
sets do not fully capture the range of shapes and 
appearances of objects in the classes; some objects do 
not have sufficiently similar counterparts in the database 
and consequently their estimates (obtained with our 
method as well as the baseline) were poor. This is not 
surprising considering the nature of the objects included 
in these sets (i.e., non-rigid and textured objects). Paired 
t-tests comparing our algorithm to the baseline method 
show the improved performance of our method to be 
significant for all three data sets, with p < 10^ for the 
human-postures, p < 10^ for the hands and p < 10^ for 
the fish data sets. 

Automatic Colorization. Some colorization results are 



presented in Fig. [17}|20| Note in particular how pairs 
of database image-maps seamlessly mesh to produce 
the output image-map in|20J where the database objects 
are presented alongside the output result. Some failed 
results are reported in Fig. |2l] We believe that the 
failed faces were due to the automatic example selection 
disregarding the different colors of the selected database 
examples. In the case of the fish, the failures were due 
to the anomalous shapes of the input depths. 

To obtain quantitative results for our colorization 
scheme we again ran leave-one-out tests on the face 
and fish data sets. Here, since the quality of our 
colorization results is subjective, we poled 10 subjects, 

1. Make3D code available from |http: / /inake3d.cs.cornell.edu| 
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Input Ground DB Our 

image truth examples result 



Input Ground DB Our 

image truth examples result 




Fig. 14. Hand depth estimates. Four out of the 52 hand-object depth estimates computed using automatically 
obtained weights (see Table [l]for parameter values). In both columns from left to right: input image, its ground truth, 
four of the five automatically selected database examples used for the reconstructions, and our output estimate. 
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image truth examples result 
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Fig. 15. Human-posture depth estimates. Four out of the 45 human-posture depth estimates computed using 
automatically obtained weights (see Table [l] for parameter values). In both columns from left to right: input image, 
its ground truth, the four automatically selected database examples used for the reconstructions, and our output 
estimate. 



asking "how many image-maps are faulty or otherwise 
appear inferior to those in the database''. Out of the 
57 fish results on average 28% were found to be faulty 
Similarly, 28% of our face results were found faulty out 
of the 76 faces in the database. The parameters used in 
these tests are reported in Table [3] 

Run-time. Our running time was approximately 7.5 min- 
utes for a 200 x 150 pixel image using 12 example images 
at any one time, on a Pentium 4, 2.8GHz computer with 
2GB of RAM. We used three pyramid levels, each scaled 
to half the size of the previous level. Patch sizes, unless 
otherwise noted, were taken to be 5 x 5 at the coarsest 
scale, 7 X 7 at the second level, and 9x9 for the finest 
scale. 

8 Conclusions and future work 

Clearly, having prior knowledge about the shapes of 
objects in the world is beneficial for determining the 
shapes of novel objects. This idea is particularly useful 



TABLE 3 

Colorization database parameters, m - Number of 

mappings (objects) Mi used for synthesis, k - Patch 

width and height, from fine to coarse scale of three 

pyramid levels. Weights for depth, depth 

high-frequencies, Y, Cb, Cr, and relative position 

components. Note that relative position is amplified for 

the structured face and humans data-sets. Also, as our 

eyes are sensitive to intensities, we amplify Y as well. 



DB Name 


m 


k 


Weights 


Humans 

Busts 

Fish 


1 

2 
2 


7,9,9 
7, 11, 9 
7, 11, 9 


0.08, 0.06, 8, 1.1, 1.1, 10 
0.08, 0.06, 8, 1.1, 1.1, 10 
0.08, 0.06, 8, 1.1, 1.1, 0.1 



w^hen only a single image of the world is available. 
Motivated by this basic understanding we formulate an 
algorithm which produces depth estimates from single 
images, given examples of typical related objects. The 
ultimate goal of our algorithm is to produce a depth 



11 



Input 
image 


Ground 
truth 


H^^^^^^ ^-^^b{9 


Q 



DB 

examples 




€r' ^ ^ *- 



.^:i'4 


it* 




» 




Fig. 1 6. Fish depth estimates. Four out of the 41 fish depth estimates computed using automatically obtained weights 
(see Table[l]for parameter values). From left to right: input image, its ground truth, four of the five aafOA77af/ca//y selected 
database examples used for the reconstructions, and our output estimate. 



estimate which is consistent with both the appearance of 
the input image and the known depths in our example 
set. We show how this goal may be formally stated and 
achieved by way of a strong optimization technique. 

At the heart of our method is the realization that the 
problem of depth estimation may be stated as using 
known mappings from appearances to depths to pro- 
duce a new, plausible mapping given a novel appear- 
ance (image). This observation is coupled with the idea 
of storing 3D geometries explicitly and using them to 
render example appearance-depth, mappings on-the-fly 
We can thus produce example mappings capturing an 
essentially infinite range of viewing conditions without 
limiting the example set a-priori. 

As a consequence, we obtain an algorithm which is 
versatile in the objects and viewing conditions it can 
be applied to. Moreover, we show that the algorithm 
is versatile in the problems it may be used to solve: 
The general formulation of our mappings allows us to 
estimate additional properties of the objects in the scene, 
in particular, the shape of the occluded back of the object. 

Future work. It seems natural to explore how additional 
information may be estimated using the same frame- 
work. For example, can foreground-background seg- 
mentation be estimated alongside the depth estimation? 
There are additional directions which we feel require 
further study. Chiefly, we would like to explore how the 



method may be improved, both in accuracy and speed. 
Here we would like to capitalize on recent advances in 
image representation and matching, mainly in dense and 
invariant image representations such as f36l, [37]. 

We believe it would also be interesting to explore how 
explicit 3D representations may be further exploited. In 
particular, can an accurate camera viewing position (or 
illumination, or posture etc.) be estimated by a simi- 
lar means of producing novel examples on-the-fly and 
comparing them to the input image? Also, can we learn 
more about the world occluded from view, by using both 
known and example information? 
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